Chris Pollett >
Old Classses > |
HW#1 --- last modified February 06 2019 04:15:52..Due date: Feb 22
Files to be submitted: Purpose: To gain some experience with language modeling, yo dust off our programming skills, and to start work on coding an inverted index. To gain experience using an open source crawler. To practice evaluating search results. Related Course Outcomes: The main course outcomes covered by this assignment are: LO1 - Code a basic inverted index capable of performing conjunctive queries. LO6 - Be able to evaluate search results by hand and using TREC eval software. Specification: This homework has three parts, some text-based problems, a write-up of some experiments with Yioop, and a coding part. For each part, I will say file names of where I want you to put stuff. After you are done the homework take all the files you have created put them into a folder Hw1, zip it to make a file Hw1.zip, and submit that. The first file I want you to make is the plain text file problems.txt. In it put your answers to the following questions:
For the second part of the homework, I want you to download and install Yioop as described in class. I want you to conduct an experiment to see how well Yioop's page summarizer is working. Next a want you to pick five Wikipedia pages of your choice. If possible, choose one of the pages to be in a language other than English in which you are fluent. For each page, come up with a list of queries that that page might be able to answer. Make sure your queries only involve words that appear on the given page. Next I want you to create by hand a short summary (say 5 sentences) for each of your Wikipedia pages. Your summaries should be in the same language as those pages. If you are working in a group of more than one person have a different person make the queries from the one who makes the summary. Next, for each of your summaries look at the list of five queries for the page the summary came from. Determine which queries have all their terms still in the summary and make a count of this for each page. For example, you might have wiki page 1 had 3 out 5 queries completely contained in the summary. Log in to the root account of your copy of Yioop, go to the Page Options activety and then the Test Options tab. For each Wikipedia page your choose, copy and paste its HTML source into the Test Options text area and click Test Process Page. Look at the output determine the number of times Yioop got the Human language of the Wikipedia page correct. Then take the summaries you get from each Wikipedia page you chose and calculate as you did for your own human summaries the number of times the summary still contained all of the query terms for each of your queries. Write up all of these experiments and maybe a paragraph conclusion in the file summaries.txt. For the coding part of the homework I want you to create a PHP program that computes a simple text-based, positional inverted index. Your program will be run from the command line using a line of format: php InvertedIndex.php some_dirHere some_dir is the path to some directory in your file system. I will test your code using PHP 5.6.10, PHP is reasonably backward compatible so as long as you don't use any really new features, if it works on your machine it should work on mine. Your code should follow the coding standards in PSR-1 and PSR-2. Your code when run with a line like the above should glob an array of all .txt files in some_dir. Using glob without the GLOB_NOSORT flag should mean this array will be sorted alphabetically. For example, we might get an array ["a.txt", "b.txt", "c.txt"]. Define the document id of a file name to be its position in this array. So c.txt has document id 2. Your program should then read in each text file in turn and use them to create a text-based, positional index which it outputs to stdout. There should be one index entry for each term found in any of these text files and you should output these entries in ascending alphabetical order. An index entry should look like: term:num_docs:freq:(doc_id1, doc_id1_pos1,doc_id1_pos2,...,doc_id1_pos_m_id1),...more records, if exist For example, apple:3:6:(1,1,6),(5,2,4,5),(23,87) In this example, "apple" is the term. It occurs in 3 documents and the total number of times it occurs is 6. This is then followed by a sorted by document id list of three records. The record (1,1,6) indicates the word apple appeared in the document with id 1 as the 1st and 6th token in that document. The record (5,2,4,5) indicates that the word apple appeared in document 5 at locations 2, 4, and 5. Finally, (23,87) indicates that apple appeared in document 23 at location 87. Your program should output a sorted sequence of lines each of which is an index entry. For example, it might begin with lines like: apple:3:6:(1,1,6),(5,2,4,5),(23,87) banana:2:6:(2,3,6,9),(4,7,18,20) cucumber:5:5:(0,16),(2,16),(3,16),(4,16),(5,16) ... If you have doubts on whether your code is outputting things as I want them, feel free to demo it to me before the homework is due. Point Breakdown
|